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ABSTRACT 

LineaiQ Support Vector Machines (e.g., SYM^'"'^, Pegasos, 
LIBLINEAR) are powerful and extremely efficient classifi- 
cation tools when the datasets are very large and/or high- 
dimensional, which is common in (e.g.,) text classification. 
Minwise hashing is a popular technique in the context of 
search for computing resemblance similarity between ultra 
high-dimensional (e.g., 2®*) data vectors such as document 
representations using higher-order shingles. b-Bit minwise 
hashing is a recent significant improvement over minwise 
hashing by storing each hashed value using only the lowest 
b bits (instead of 64 bits). 

In this paper, we propose to (seamlessly) integrate b-bit min- 
wise hashing with linear SVM to substantially improve the 
training (and testing) efficiency using much smaller memory, 
with essentially no loss of accuracy. Theoretically, we prove 
that the resemblance matrix, the minwise hashing matrix, 
and the b-bit minwise hashing matrix are all positive definite 
matrices (kernels). However, since the resemblance kernel is 
non-linear, it appears not straightforward to use it for linear 
SVM. Interestingly, our proof for the positive definiteness of 
the b-bit minwise hashing kernel naturally suggests a simple 
strategy to integrate b-bit hashing with linear SVM, which 
only requires a very minimal modification of LIBLINEAR. 
Our technique is particularly useful when the data can not 
fit in memory, which is an increasingly critical issue in large- 
scale machine learning. 

Our preliminary experimental results on a publicly available 
webspam dataset (350K samples and 16 million dimensions) 
verified the effectiveness of our algorithm. For example, the 
training time was reduced to merely a few seconds. 

In addition, our technique can be easily extended to many 
other linear and nonlinear machine learning applications (on 
binary data) such as logistic regression. We will report 



experimental results in subsequent manuscripts. 

1. INTRODUCTION 

The method of b-bit minwise hashing |30II31||33] is a very re- 
cent progress for efficiently (in both time and space) comput- 
ing resemblances among extremely high-dimensional (e.g., 
2®^) binary vectors, which may be documents represented 
by w-shingles with ui = 5 or 7 [2]|3]. In this paper, we show 
that b-bit minwise hashing can be seamlessly integrated with 
linear Support Vector Machine (SVM) lO lS lS, 37:40]. In 
SIGKDD 2010, the nice work ^0] addressed a critically im- 
portant problem about training linear SVM when the data 
can not fit in memory. In this paper, our work also tackles 
the similar problem from a different dimension. 

1.1 Minwise Hashing 

The seminal work of minwise hashing l2][3] has been suc- 
cessfully applied to a very wide range of real- world problems 
especially in the context of search, including duplicate Web 
page removal [SK^, text reuse in the Web [1], detection of 
large-scale redundancy in enterprise file systems [12] , syntac- 
tic similarity algorithms for enterprise information manage- 
ment [7], content matching for online advertising [36], Web 
graph compression [4], Web spam [T71[38], community ex- 
traction and classification in the Web graph W, compressing 
social networks 8 , advertising diversification 13 , wireless 
sensor networks 19 , graph sampling [35], and more. 

Computing the size of set intersections is a fundamental 
problem in information retrieval, databases, and machine 
learning. For example, binary document vectors represented 
using w-shingles can be viewed either as vectors in very high- 
dimensions or as sets. Given two sets, 5*1 and 5*2, where 

Si, S2CJ7 = {0,1,2,...,D-1}, 

a widely used measure of similarity is the resemblance 

l^inSal a 



^First draft in Feb. 2011. Slightly modified in May 2011. 
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Minwise hashing applies a random permutation tt : — ^ f2 
on Si and 6*2. Based on an elementary probability result: 

\Si n S2 



Pr (min(7r(S'i)) = mm(n{S2))) 



|5iUS2| 



i?, 



(1) 



one can store the smallest elements under tt, i.e., min(7r(S'i)) 
and min(7r(S'2)), and then repeat the permutation k times 



to estimate R. After k minwise independent permutations, 
TTi, TT2, TTfc, One Can estimate R without bias, as: 

k 

= 1 E l{minfe(5i)) = min(^,(S2))}, (2) 
Var (i?A/) = ^R{1 - R). (3) 

The common practice is to store each hashed value, e.g., 
min(7r(S'i)) and min(7r(S'2)), using 64 bits [TT]. The stor- 
age cost (and consequently the computational cost) will be 
prohibitive in truly large-scale applications |34] . 

1.2 b-Bit Minwise Hashing 

The recent development of b-bit minwise hashing ^WD pro- 
vides a solution to the (storage and computational) problem 
of minwise hashing by storing only the lowest b bits (instead 
of 64 bits) of each hashed value for a small b. 

Again, consider two sets, Si,S2 C SI = {0, 1, 2, D — 1}. 
Define the minimum values under the random permutation 
TT : SI — )■ to be: zi = min (vr (Si)) and Z2 — min (tt (Si))- 

Define ei^i = ith lowest bit of zi, and 62, i = ith lowest bit 
of Z2. Theorem [T] provides an interesting probability result. 



Interestingly, in this paper, we will show that we can use 
b-bit minwise hashing in linear SVM without directly using 
the estimator 7?;, ((8|. We will prove that the resemblance 
describes a family of positive definite kernels (hence natu- 
rally suitable for SVM), which however is non-linear. More 
interestingly, we will prove that the matrices generated by 
minwise hashing and b-bit minwise hashing are also positive 
definite. Furthermore, our proof directly suggests a simple 
implementation to use b-bit hashing with linear SVM, with 
only a very minimal modification of the original code. 

1.3 Ultra High-Dimensional Large Datasets 

In the context of search, a standard procedure to represent 
documents (e.g., Web pages) is to use w-shingles (i.e., w 
contiguous words), where ui = 5 or 7 in several studies 
lllj . This procedure can generate datasets of extremely high 
dimensions. 

For example, suppose we only consider 10^ common English 
words. Using w = 5 may require the size of dictionary SI to 
he D = \n\ = 10^^ = 2*^; and w = 7 requires D = 2"^ 
In current practice, it looks D = 2®* often suffices, as the 
number of available documents may not be large enough to 
exhaust the dictionary. However, as the Web continues to 
grow at a fast rate, it may be possible that we have to use 
D > 2^^ in the near future. 



Theorem 1. \31f Assume D is large. 

Ft = Pr 1 {"^i-' = = ^ j = '^1''' + " *^2,i>) 

/i /: 
, r2 = 

r-2 



^here n = ^, ^2 = g, /i = .[2 = \S2\ (4) 
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_ ri [1 - n] r2 [1 - r2] , . 

Ai^b - —, A2,b - —.a (7) 



1 - [1 - ri] 



1 - 1 - r2 



2'' 



Once the basic probability formula is known, one can repeat 
the permutations k times to estimate Pb in Q from which 
one can estimate the resemblance R. That is 



Rb = 



Pb ~ C*!,; 
1 - C2,b 
k ( b 



(8) 



^'' = lE|ril{^i.'-. =e2,,:,.J = l|, (9) 



|31) carefully analyzed the variance of Rb and compared it 
with the variance of the original minwise hashing estimator 
Rm. The result is encouraging. To estimate any R > 0.5, 
even in the least favorable situation, using b = 1 only re- 
quires to increase the number of permutations by a factor 
of 3, in order to achieve the same estimation variance as 
Rm (which uses 6 = 64 bits). Therefore, a 21.3-fold (64/3) 
improvement is attained if one is mainly interested in resem- 
blance R > 0.5. 



With w > 5, normally only the abscence/presence (0/1) in- 
formation is used, as a w;-shingle is unlikely to occur more 
than once in a page. The total number of shingles is usu- 
ally set to be |S7| = 2®*. Thus, the set intersection be- 
comes the inner products in binary data vectors of 2^* di- 
mensions. Interestingly, even when the data are not too 
high-dimensional, empirical studies [51 ll4|[T5] achieved good 
performance using SVM with binary-quantized (text or im- 
age) data. 

The webspam dataset, which can be downloaded from the 
LibSVM site and was used in [40], is among the largest pub- 
lic classification datasets. It consists of 350, 000 documents 
presented by 3-shingles in 16,609,143 dimensions. The av- 
erage number of non-zeros per sample is about 3730. We 
will mainly use the webspam dataset to verify our proposed 
algorithm. We expect that the use of higher-order shingles 
(e.g., TO > 5) in machine learning will become more com- 
mon after we demonstrate the effectiveness of our algorithm 
which naturally integrates linear SVM with b-bit hashing. 

2. LINEAR SVM 

Linear SVMs have become very powerful and extremely pop- 
ular. Representative software packages include SVM''°'^' [18) . 
Pegasos 07], and LIBLINEAR [1^. The 2008 PASCAL 
Large Scale Learning Challenge compared various implemen- 
tations of linear SVM and identified LIBLINEAR as the win- 



Given a dataset {(xi, yi)}r=i, Xi € , yi € {-1, 1}. SVM 
solves the following optimization problem (primal): 



1 ^ 
min — w'^w -f max |l — yiw'^Xi, o| 



(10) 



where C > is a penalty parameter. It is often more con- 
venient to solve the dual problem: 

min /(a) — ^a^Qa — e^a, (11) 

a 2 

subject to < Oi < C, i = 1,2, n. 
where Qij = yiyj:Kfxj and e G R" an all-one vector. 

Our implementation will be based on the LIBLINEAR pack- 
age |10) . which implemented linear SVM using dual coordi- 
nate descent algorithm [15) : see Algorithm [T] 



Algorithm 1 The dual coordinate descent method for linear 
SVM jK]. We modified it for only Li linear SVM, which was 
considered in |40) . 

• Given a and w — yiOiXi. 

• While a is not optimal 
For i = 1 to n 

1. ai -tr- Ui 

2. G = yiw'^Xi — 1 
3. 

( mm{G, 0) if Qi = 
PG = <^ max(G, 0) if = C 

[ G if < a, < C 

4. If \PG\ / 0, 

fti <— min(max(Qi — G/Qii,0), C) 
w ^ w + (oi — ai)yiXi 



2.1 The Memory Bottleneck 

When the data can fit in memory, linear SVM is often ex- 
tremely efficient after the data are loaded into the memory. 
It is however often the case that the data loading time domi- 
nates the computing time for solving the SVM problem [40] . 

A much more severe problem arises when the data can not fit 
in memory. This situation can be very common in practice. 
The publicly available webspam dataset needs about 24GB 
disk space, which exceeds the memory capacity of many 
desktop PCs. Note that webspam contains only 350,000 doc- 
uments represented by 3-shingles. Practical applications, 
however, may involve hundreds of millions (or billions) of 
Web pages represented by (e.g.,) 5-shingles. 

2.2 Block Linear SVM 

[40j proposed solving the memory bottleneck by block linear 
SVM. Basically, they partitioned the data into m blocks, 
calculated according to the available memory space. At each 
step, they loaded one block into the memory to update the 
coefficients a or w (which are assumed to reside in memory) . 
When working with each block of data, they actually used 
the Pegasos ^37,. procedure internally. 

While the block linear SVM algorithm provides a nice so- 
lution to the urgent practical problem, it does not appear 
to be very easy to implement. Moreover, the computational 
bottleneck is still at the memory because loading the data 



blocks for many iterations consumes a large number of disk 
lOs. 

The authors of '40' conducted thorough experiments on three 
datasets. The largest dataset is webspam. Their experiments 
were conducted on a machine with only 1GB memory so that 
they could better investigate the impact of data splitting 
(such as block size) on the performance. 

2.3 A Brief Introduction of Our Proposal 

We propose a very different solution by using b-bit min- 
wise hashing. We assume the data vectors are very high- 
dimensional and relatively very sparse. For example, if the 
dimension D — 2®*, then even a set of size 2^* (which cor- 
responds to the equivalent of the amount of text in a small 
novel, when represented via shingles) will be relatively very 
sparse because 2"^" « 0.001, even though in an absolute 
magnitude 2^'* is a very large number. 

We also consider that the data are binary, which as pre- 
viously explained is a reasonable assumption in important 
practical scenarios. In fact, our experiments on webspam will 
show that binary quantizing that dataset essentially does not 
affect the accuracy. 

With the above assumptions, we can apply b-bit minwise 
hashing on the dataset to obtain a very compact represen- 
tation of the original data. Suppose we conduct k permu- 
tations and store the lowest b bits for each hashed (i.e., the 
minimum) value, then the total storage is only nbk bits. 

For example, consider the webspam dataset {n = 350000), 
b = 8 and k = 200, then the total storage is only 70 MB. 
However, in order to use b-bit minwise hashing for linear 
SVM, we have to solve the following two problems: 

• If we use b-bit minwise hashing to estimate the resem- 
blance, which (we will soon prove) represents a posi- 
tive definite nonlinear kernel, how can we effectively 
convert this nonlinear problem into a linear problem? 

• We need to prove that the matrices generated by b-bit 
minwise hashing are indeed positive definite, which will 
provide the solid foundation for our proposed solution. 

It turns out that our proof in the next section that b-bit 
hashing matrices are positive definite naturally provides the 
construction for converting the otherwise nonlinear SVM 
problem into linear SVM. 

Clearly, one should notice that our method is not really a 
competitor of the approach in |40| . In fact, both approaches 
may work together to solve extremely large problems. For 
example, suppose k — 200 and 6 = 8, then one billion {1(f) 
documents may require 200GB memory, which may still ex- 
ceed the capacity of most workstations. In this case, we can 
still apply the block linear SVM using the hashed data. 

3. B-BIT MINWISE HASHING KERNELS 

This section proves some theoretical properties of matrices 
generated by resemblance, minwise hashing, or b-bit min- 
wise hashing. We will show that they are all positive defi- 
nite matrices (kernels). Our proof not only provides a solid 



theoretical foundation for using b-bit hashing in SVM, but 
also illustrates the ideas behind the construction required 
for integrating linear SVM with b-bit hashing. 



Definition: A symmetric n x n matrix K satisfying 
^CiCjKij > 

for all real vectors c is called positive definite (PD). Note 
that here we do not differentiate PD from nonnegative def- 
inite following the convention in machine learning literature. 



Consider n sets 5*1, 5*2, Sn G = {0, 1, D — 1}. Apply 
one permutation tt to each set and define Zi = min{7r(5'i)}. 



We will prove that the following three matrices are all PD. 

• The resemblance matrix H G R"^", whose (i, j)-th en- 
try is the resemblance between set St and set Sj: 



Ri 



|5,ns,j 



The minwise hashing matrix M G R"^": 



(12) 



My = l{z, = zj} (13) 
• The b-bit minwise hashing matrix M*''' G R"^": 

b 

= n 1 = (14) 

where Ci^t is the t-th lowest bit of Zi. 

Our proof follows the basic principle. That is, a matrix A 
is PD if it can be written as an inner product B'^B. 



Theorem 2. The minwise hashing matrix M G R"^" de- 
fined by IS PD. 

Proof: We can write 

D-l 

Mrj = l{zi = 2j} = ^ = t} X l{zj = t} 

t=0 

Therefore, Mij is the inner product of two high-dimensional 
vectors of length D and hence M can be written as an inner 
product M = B'^B, where B G K-^*"". This completes the 
proof. □ 

Theorem 3. The b-bit minwise hashing matrix M'-''-' G 
R"'<" defined by is PD. 

Proof: We can write 

Therefore, M^j^ is the inner product of two 2'' -dimensional 
vectors. This completes the proof. D 



Theorem 4. The resemblance matrix R G R"'^" defined 
by {Hi IS PD. 

Proof: The proof easily follows from the fact that Rij = 
Pr{Mij = 1} = E{Mij) and Mij is the (i,j)-th element of 
the PD matrix M. 



One might be wondering if we need to worry about the fact 
that there are k permutations instead of just one tt. For 
example, there will be k minwise hashing matrices: Mj^j, 
s = 1 to fc. Note that summation ■'^(s) still PD 

since 



Em 



for any vector c by the fact that Mj^) is PD. 
Similarly, the average |^-^ X]s=i ■'^(s)] c > 0. 



Note that elements of a PD matrix satisfies the triangle in- 
equality while the converse is not necessarily true. In fact, it 
is well-known that Rij satisfies the triangle inequality [2j|6], 
although to the best of our knowledge, we have not seen a 
direct proof that the resemblance matrix R is PD. 



On the other hand, the fact that R is PD does not seem 
to help us too much for efficient SVM training, because the 
resemblance is a nonlinear operation. However, the proof 
that the b-bit minwise hashing matrix M'*' is PD provides 
us with a very simple strategy to construct a matrix B such 
that M'''' = B'^B, where B has dimensions only 2* x 2^ As 
long as b is not too large, this provides a highly affordable 
way to expand each minwise hashed value using b bits. 

4. INTEGRATING LINEAR SVM WITH B- 
BIT MINWISE HASHING 

Using the construction in the proof of Theorems [2] and [3] 
our algorithm for integrating b-bit hashing with linear SVM 
becomes extremely simple (at least in retrospect). 

Given a dataset {xi, yi}"^-i, where G is a D-dimensional 
binary data vector (which is equivalent to a set). We apply 
k independent random permutations on each x^ and store 
the lowest b bits of each hashed value. This way, we obtain 
a new dataset which can be stored using merely nbk bits. 
In the run-time, however, we need to expand each new data 
point into a 6 x fc-length vector. 

For example, suppose fc = 3 and the hashed values are orig- 
inally {12013, 25964, 20191}, whose binary digits are 
{OIOIUOUIOUOI, UOOIOIOUOUOO, 100111011011111}. 

Consider 6 = 2. Then the binary digits are stored as {01, 00, 11} 
(which corresponds to {1, 0, 3} in decimals). In the run-time, 
we need to expand them into a vector of length 2''k — 12, 
to be 

{0,0,1,0, 0,0,0,1, 1,0,0,0} 



which will be the new Xi fed to a linear SVM solver. 



We have very slightly modified LIBLINEAR HIT to incorpo- 
rate b-bit hashing. Since we really only need small b values 
such as 6 = 8, both the training and testing become very 
efficient on the hashed dataset, as verified by experiments. 

5. EXPERIMENTAL RESULTS ON WEBSPAM 

Our experiment settings follow the work in SIGKDD 2010 [ID] 
very closely. They conducted experiments on three datasets: 
(i) the Yahoo-Korean dataset is proprietary, (ii) the epsilon 
dataset is completely dense and low-dimensional, and (iii) 
the webspam is the largest among the three and reasonably 
high-dimensional (n = 350000, D = 16609143). Therefore, 
our experiments focus on the webspam dataset. 

Following [4^, we randomly selected 20% of samples for test- 
ing and used the rest 80% samples for training. 

5.1 Binary v.s. Real- Value Data 

Our current implementation is only for binary data, which 
is probably the most important case when documents are 
represented by w-shingles with w > 5. Even for webspam 
which only used w — 3,we notice that a binary-quantization 
on the dataset does not really affect the classification results. 

Since there is a tuning parameter C, we conducted the ex- 
periments on a series of C values ranging from 0.001 to 100. 
Figure [1] presents the results in terms of the number of sup- 
port vectors (nSV), the testing accuracy (%), the training 
time (seconds), and the testing time (seconds), for both the 
original dataset and the binary-quantized dataset. 

Note that, although we plot the results as functions of C 
values, we do not intend to say that the results are directly 
comparable at a given C. There are two issues. Firstly, the 
two datasets will have different scales and hence their opti- 
mal C values may be quite different. Secondly, in practice, 
we will conduct cross-validations to find the optimal C for 
the best classifiers. 

Since our purpose is to demonstrate the effectiveness of our 
proposed linear SVM scheme using b-bit hashing, we simply 
provide results for all C values and we assume that the best 
performance is achievable if we conduct cross-validations. 

Clearly, Figure[T]illustrates that binary-quantization on web- 
spam does not degrade the performance in any aspect. 

5.2 Evaluations of Testing Accuracy 

Figures [2] (average) and[3](std, standard deviation) provide 
the test accuracies. We experimented with fc = 30 to = 
500, although prior practice [2l[3ll31| suggested that k — 100 
to fc = 200 should provide good results. We let 6 = 1, 2, 4, 
8, and 16. 

Since our method is a randomized algorithm, we repeat ev- 
ery experiment 50 times. We report both the mean and std 
values. Figure |3] illustrates that the stand deviations are 
very small, especially with & > 4 (< 0.1%). Figure [2] demon- 
strates that using b > 8 and k > 100 achieves about the 
same test accuracies as using the original data. 



14 
12 
10 
S 8 
= 6 
4 
2 

o' 

10" 

600 

-0-500 

CD 

t400 
E 

= 300 
cn 

I 200 

^ 100^ 


10 







-©-Original 
Binary 


















^ Webspam: nSV 































Webspam: Accuracy 


i 








-•-Original 
-9- Binary 









-•-Original 

-9- Binary 




















d 


Webspam: Train time 










a 






» -< 













300 










-e-Original 
-©-Binary 


250 












200 




- 








150 






















100 


Webspam: Test time 





10" 10" 10" 10' 10' 10"' lO"'' 10" 10" 10' 10 



Figure 1: The purpose is to show that with binary 
quantization, the performance of linear SVM does 
not degrade. Note that we did not re-normalize the 
quantized data (to have unit norm) for this partic- 
ular experiment. By private communications with 
Authors of LIBLINEAR, it looks they usually nor- 
malized the data, even for binary data. Therefore, 
our future reports will always normalize the data. 

5.3 Evaluations of Training Time 

Compared with the original training time (about 100 to 200 
seconds in Figure [!}, we can see from Figure |4] that only a 
very small fraction of the original cost is needed using our 
method. 

Note that we did not include the data loading time in both 
the original method and our new method. Loading the orig- 
inal data took about 12 minutes while loading the hashed 
data took only about 10 seconds. Of course, there is a cost 
for processing (hashing) the data, which we find is very ef- 
ficient, confirming prior studies [5]. In fact, data processing 
can be conducted during load collection, as the standard 
practice in search. 

5.4 Evaluations of Testing Time 

Compared with the original testing time (about 150 to 200 
seconds in Figure [!}, we can see from Figure [5] that only a 
very small fraction of the original cost is needed using our 
method (only a few seconds). 

Note that the testing time includes both the data loading 
time and computing time, as designed by LIBLINEAR. The 
efficiency of testing may be very important in practice, for 
example, when the classifier is deployed in an user-facing 
application (such as search), while the cost of training or 
pre-processing (such as hashing) may be less critical and 
can often be conducted off-line. 

6. RELATED WORK 

In this paper, we focus on describing our method for sig- 
nificantly improving linear SVM in high-dimensional binary 
datasets, which are common in (commercial) text applica- 
tions. As many integer data can be transformed into bi- 




Figure 2: Test Accuracy. With k > 100 and b > 8. b- 
bit hashing achieves very similar accuracies as using 
the original (binary-quantized) data. The results are 
averaged over 50 repetitions. 

nary data by (significantly) increasing the dimensions, our 
method is actually quite general. In fact, our method can 
be easily extended to many other linear and nonlinear learn- 
ing algorithms such as logistic regression. We focus on 
linear SVM partly because other learning methods such as 
tree-based algorithms (which are also extremely popular in 
industry) are not particularly suitable for extremely high- 
dimensional (e.g., 2^^) data. See one of the authors' recent 
work on abc-boost 22,23], which had the detailed compar- 
isons with (kernel) SVM and deep learning on a variety of 
not-too-high-dimensional datasets. 

In the past years, we have been working on a variety of 
hashing/sketching/sampling methods to deal with extremely 
large-scale high-dimensional data. Examples are normal 
and normal-like random projections t27.,28). stable random 
projections [20ll21|[26ll29) . Conditional Random Sampling 
(CRS) [21E5], as weU as b-bit minwise hashing [30|l3ni33] . 
Many of those papers used (kernel) SVM as the motivating 
applications. Currently, we focus on b-bit minwise hashing 
because that method appears to be the state-of-the-art al- 



Figure 3: Test Accuracy (STD). The standard de- 
viations are computed from 50 repetitions. When 
6 > 8, the standard deviations become extremely 
small (e.g., 0.02%). This means our randomized al- 
gorithm produces very stable results. 

gorithm for binary data and now (as in this paper) we have 
discovered the simple and powerful technique to apply it to 
many large-scale learning problems. 

Recently, a highly interesting hashing method was devel- 
oped also for efficient SVM training ,39] , which reported the 
identical estimation variance as the special case in one of 
our earlier random projection papers [28] (by using "s = 1" 
in .28. J3- Note that for randomized algorithms, it is es- 
sentially the variance which controls the storage size and 
algorithm complexity. 

To compare b-bit minwse hashing with random projections 
(including 39 , which reported the same variance as random 
projections [28j), a report [32J was written to compare their 
variances. [31] reported the theoretical comparisons, illus- 
trating that 6-bit minwise hashing improves random projec- 

^We appreciate John Langford, one of the authors of for 
the highly helpful communications. 
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Figure 4: Training time. Compared with the train- 
ing time of using the original data in Figure [Jl we 
can see that our method with b-bit hashing only 
needs a very small fraction of the original cost. The 
bottom two panels plot the standard deviations. 



tions often by 10- to 100-fold. In other words, to achieve the 
same accuracy, random projections would often require 10 
to 100 times more storage than 6-bit minwise hashing. This 
is a substantially large difference and should be noted by re- 
searchers and practitioners in large-scale machine learning. 

7. CONCLUSION 

We develop a simple and very efficient scheme to seam- 
lessly integrate b-bit minwise hashing with linear SVM (in 
particular, LIBLINEAR). Our method requires only very 
small modification of the original code. Experiments demon- 
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Figure 5: Testing time. Compared with the origi- 
nal testing time (about 150 to 200 seconds in Fig- 
ure [l]), only a very small fraction of the original cost 
is needed using our method. The bottom two panels 
plot the standard deviations. 



strate that our proposed method is very effective, using 
much smaller memory and less training time, to achieve es- 
sentially the same test accuracy. The testing stage also be- 
comes much more efficient, which may be highly beneficial 
in important real-world applications such as search. 

Our proposed method provides an elegant and simple solu- 
tion when the datasets do not fit in memory. In 10 . they 
assumed the memory limit was only 1GB and hence they 
had to load the data in blocks (for multiple passes), incur- 
ring high 10 costs. Note that, as we conducted our experi- 
ments on a workstation of 48 GB memory, we always loaded 
the entire dataset since the original (webspam) data size 
(about 24 GB) did not exceed the memory capacity. We ex- 
pect that (commercial) applications may often involve high- 
dimensional datasets on the TB scale (or even much larger) 
and hence our method will have significant advantages for 
those applications. 
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